-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG-26988 implement replace for categorical blocks #27026
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #27026 +/- ##
==========================================
- Coverage 91.99% 91.98% -0.02%
==========================================
Files 180 180
Lines 50774 50790 +16
==========================================
+ Hits 46712 46719 +7
- Misses 4062 4071 +9
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #27026 +/- ##
==========================================
- Coverage 93.07% 91.86% -1.22%
==========================================
Files 192 180 -12
Lines 49562 50833 +1271
==========================================
+ Hits 46130 46697 +567
- Misses 3432 4136 +704
Continue to review full report at Codecov.
|
doc/source/whatsnew/v0.25.0.rst
Outdated
@@ -585,6 +585,7 @@ Categorical | |||
|
|||
- Bug in :func:`DataFrame.at` and :func:`Series.at` that would raise exception if the index was a :class:`CategoricalIndex` (:issue:`20629`) | |||
- Fixed bug in comparison of ordered :class:`Categorical` that contained missing values with a scalar which sometimes incorrectly resulted in ``True`` (:issue:`26504`) | |||
- Bug in :func:`DataFrame.replace` and :func:`Series.replace` that would give unusual results on categorical data (:issue:`26988`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"unusual" --> "incorrect"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
pandas/core/internals/blocks.py
Outdated
@@ -2978,6 +2978,29 @@ def where(self, other, cond, align=True, errors='raise', | |||
axis=axis, transpose=transpose) | |||
return result | |||
|
|||
def replace(self, to_replace, value, inplace=False, filter=None, | |||
regex=False, convert=True): | |||
if filter is None or not self.mgr_locs.isin(filter).any(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this block looks unrelated to the linked issue. what case is it solving? needs a separate test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first case is used when the replace function applies to the entire block, so the replace can be carried out just by manipulating categories. Most calls to Series.replace
should be covered with this case (including the new test in series/test_replace.py
). If there's a filter, this might not be possible so we fix the categories then use the original implementation (second case)
I added a test in frame/test_replace.py
to cover the second case, where a filter is used. It improves upon the previous behaviour, which would have casted the frame to object
Failure unrelated #26842 |
pandas/tests/frame/test_replace.py
Outdated
# GH 26988 | ||
df = DataFrame([[1, 1], [2, 2]], columns=['a', 'b'], dtype='category') | ||
expected = DataFrame(final_data, columns=['a', 'b'], dtype='category') | ||
expected['a'].cat.set_categories([1, 2, 3], inplace=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
try not to use inplace in tests, its non-idiomatic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed the merge conflict. LGTM.
Are there tests for the non-None |
@jbrockmendel yes, the dataframe tests take that branch while the series tests do not |
If it goes on Categorical, it should be generalized to EA. |
probably. But I think its much better then jamming things in the blocks (now we already have replace on the blocks, this PR might be ok actually), but I don't fully remember. |
I don't think we (or downstream EA authors) can reliably implement ExtensionArray.replace since it's such a complicated method. We're also relying on Block.replace to do the heavy lifting. An ExtensionArray.replace wouldn't have that luxury. |
can you merge master |
will take a look this afternoon |
doc/source/whatsnew/v1.0.0.rst
Outdated
@@ -88,6 +88,7 @@ Categorical | |||
^^^^^^^^^^^ | |||
|
|||
- Added test to assert the :func:`fillna` raises the correct ValueError message when the value isn't a value from categories (:issue:`13628`) | |||
- Bug in :func:`DataFrame.replace` and :func:`Series.replace` that would give incorrect results on categorical data (:issue:`26988`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"func" -> "meth"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
pandas/core/arrays/categorical.py
Outdated
cat.remove_categories(to_replace, inplace=True) | ||
else: | ||
index = categories.index(to_replace) | ||
categories[index] = value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if value
is already in categories?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed with new case
Only major question is whether we need to add |
pandas/core/arrays/categorical.py
Outdated
inplace = validate_bool_kwarg(inplace, "inplace") | ||
cat = self if inplace else self.copy() | ||
categories = cat.categories.tolist() | ||
if to_replace in categories: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens if to_replace
is not hashable? I think at the Categorical level we are OK raising, but at the Block level we would cast to object right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would casting to object help?
>>> import pandas as pd
>>> s=pd.Series([1,2]).astype(object)
>>> s.replace(1,[1,2,3])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "pandas/core/series.py", line 4303, in replace
method=method,
File "pandas/core/generic.py", line 6819, in replace
raise TypeError(msg) # pragma: no cover
TypeError: Invalid "to_replace" type: 'int'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback I've merged master and fixed all comments except this one for the reason above
this looked pretty good, can you merge master and address any remaining comments. |
can you merge master |
@JustinZhengBC can you merge master. itd be nice to get this in |
@jbrockmendel done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question about an error condition
@@ -2471,6 +2471,51 @@ def isin(self, values): | |||
code_values = code_values[null_mask | (code_values >= 0)] | |||
return algorithms.isin(self.codes, code_values) | |||
|
|||
def replace(self, to_replace, value, inplace: bool = False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you type these more specifically (can be a followon PR)
@jreback how should parameters like |
right, but they are something like Union[Scalar, AnyArrayLike] I think? (again this is likely tricky so should do as a followup) |
Fixed the merge conflict. I think we decided to leave typing as a followup, so this should be good to go? |
result.values.add_categories(value, inplace=True) | ||
return super(CategoricalBlock, result).replace( | ||
to_replace, value, inplace, filter, regex, convert | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can any of the logic in this method generalize to ExtensionBlock? (presumably we'd need to add replace
to the EA interface?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a good point, but we can certainly do as a followup; can you open an issue for discussion.
@JustinZhengBC if you can address @jbrockmendel comments to add some comments this lgtm and we can merge. (alternatvely @jbrockmendel could do as a followup) |
@jreback @jbrockmendel I added the comment |
Seems like this is good to merge. |
thanks for the patch @JustinZhengBC very nice! |
git diff upstream/master -u -- "*.py" | flake8 --diff
The replace functions used in the code path that caused the bug report appeared to densify the series. This PR adds a replace function to
CategoricalBlock
that takes advantage of the categorical data type when possible